首页> 外文OA文献 >An automatic indexing technique for Thai texts using frequent max substring
【2h】

An automatic indexing technique for Thai texts using frequent max substring

机译:使用频繁的最大子字符串的泰语文本自动索引技术

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。
获取外文期刊封面目录资料

摘要

Thai language is considered as a non-segmented language where words are a string of symbols without explicit word boundaries, and also the structure of written Thai language is highly ambiguous. This problem causes an indexing technique has become a main issue in Thai text retrieval. To construct an inverted index for Thai texts, an index terms extraction technique is usually required to segment texts into index term schemes. Although index terms can be specified manually by experts, this process is very time consuming and labor-intensive. Word segmentation is one of the many techniques that are used to automatically extract index terms from Thai texts. However, most of the word segmentation techniques require linguistic knowledge and the preparation of these approaches is time consuming. An n-gram based approach is another automatic index terms extraction method that is often used as indexing technique for Asian languages including Thai. This approach is language independent which does not require any linguistic knowledge or dictionary. Although the n-gram approach out performs many indexing techniques for Asian languages in term of retrieval effectiveness, the disadvantage of n-gram approach is it suffers from large storage space and long retrieval time. In this paper we present the frequent max substring mining to extract index terms from Thai texts. Our method is language-independent and it does not rely on any dictionary or language grammatical knowledge. Frequent max substring mining is based on text mining that describes a process of discovering useful information or knowledge from unstructured texts. This approach uses the analysis of frequent max substring sets to extract all long and frequently-occurred substrings. We aim to employ the frequent max substring mining algorithm to address the drawback of n-gram based approach by keeping only frequent max substrings to reduce disk space requirement for storing index terms and to reduce the retrieval time in order to deal with the rapid growth of Thai texts.
机译:泰语被认为是一种非分段语言,其中单词是一串没有明显单词边界的符号,而且泰语的结构也很模糊。此问题导致索引技术已成为泰文检索中的主要问题。为了为泰语文本构造一个倒排索引,通常需要使用索引词提取技术将文本分段成索引词方案。尽管可以由专家手动指定索引词,但是此过程非常耗时且劳动密集。分词是用于自动从泰语文本中提取索引词的众多技术之一。但是,大多数分词技术都需要语言知识,而这些方法的准备非常耗时。基于n元语法的方法是另一种自动索引词提取方法,通常用作包括泰语在内的亚洲语言的索引技术。这种方法是独立于语言的,不需要任何语言知识或词典。尽管n-gram方法在检索效率方面为亚洲语言执行了许多索引技术,但n-gram方法的缺点是存储空间大且检索时间长。在本文中,我们介绍了频繁的最大子字符串挖掘,以从泰语文本中提取索引词。我们的方法与语言无关,并且不依赖于任何词典或语言语法知识。 max子字符串的频繁挖掘基于文本挖掘,该文本挖掘描述了从非结构化文本中发现有用信息或知识的过程。此方法使用对频繁的最大子字符串集的分析来提取所有长且频繁出现的子字符串。我们的目的是通过仅保留频繁的最大子串来减少基于存储索引项的磁盘空间需求并减少检索时间,以应对n的快速增长,从而采用频繁的最大子串挖掘算法来解决基于n-gram的方法的缺点。泰国文字。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号